Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of max_batches parameter. #1087

Merged
merged 7 commits into from Aug 21, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1005 by adding a max_batches parameter to AutoMLSearch. When both max_pipelines and max_batches are set, max_pipelines will take precedence.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 20, 2020

Codecov Report

Merging #1087 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1087   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files         192      192           
  Lines       10853    10894   +41     
=======================================
+ Hits        10842    10883   +41     
  Misses         11       11           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.56% <100.00%> (+<0.01%) ⬆️
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbe513f...49f28aa. Read the comment docs.

@@ -76,7 +76,8 @@ def __init__(self,
n_jobs=-1,
tuner_class=None,
verbose=True,
optimize_thresholds=False):
optimize_thresholds=False,
max_batches=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding to docstring because we want to keep this "hidden" for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, after thinking on it for a bit, my preference would be to name this _max_batches and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, adding the docstring now will also help us with development and keep track of what this is for!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it to max_batches and add it to the docstring!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This is going to make the perf testing a lot easier :)

I left a comment on making the parameter private as opposed to unlisted (feel free to push back if you disagree), and some minor test suggestions.

@@ -76,7 +76,8 @@ def __init__(self,
n_jobs=-1,
tuner_class=None,
verbose=True,
optimize_thresholds=False):
optimize_thresholds=False,
max_batches=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.

@@ -166,7 +167,7 @@ def __init__(self,
raise TypeError("max_time must be a float, int, or string. Received a {}.".format(type(max_time)))

self.max_pipelines = max_pipelines
if self.max_pipelines is None and self.max_time is None:
if self.max_pipelines is None and self.max_time is None and max_batches is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self._max_batches = max_batches
# This is the default value for IterativeAlgorithm - setting this explicitly makes sure that
# the behavior of max_batches does not break if IterativeAlgorithm is changed.
self._pipelines_per_batch = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, thanks.

@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl

if self.allowed_pipelines == []:
raise ValueError("No allowed pipelines to search")
if self._max_batches and self.max_pipelines is None:
self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. When we change our automl algorithm we'll have to remember to update this. I spent some time considering if there's a way to have IterativeAlgorithm compute this for us, but I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you got it @angela97lin !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah gotcha, thanks @freddyaboulton! 😊

@@ -377,7 +387,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
tuner_class=self.tuner_class,
random_state=self.random_state,
n_jobs=self.n_jobs,
number_features=X.shape[1]
number_features=X.shape[1],
pipelines_per_batch=self._pipelines_per_batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1))

assert automl._automl_algorithm._batch_number == max_batches
assert len(automl._results["pipeline_results"]) == n_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check automl.rankings and automl.full_rankings? Also please use the public accessor for results, i.e. automl.results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

# So that the test does not break when new estimator classes are added
n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1))

assert automl._automl_algorithm._batch_number == max_batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we don't surface this publicly right now. You're reminding me we should eventually pin down an API for exposing this sort of information.

Please do:

assert automl._automl_algorithm.batch_number == max_batches
assert automl._automl_algorithm.pipeline_number == n_results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tips! I was looking at IterativeAlgorithm instead of the base class.


@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.8})
@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
def test_max_batches_plays_nice_with_other_stopping_criteria(mock_fit, mock_score, X_y_binary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha nice

def test_max_batches_must_be_non_negative(max_batches):

with pytest.raises(ValueError, match="Parameter max batches must be None or non-negative. Received {max_batches}."):
AutoMLSearch(problem_type="binary", max_batches=max_batches)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful!

@@ -76,7 +76,8 @@ def __init__(self,
n_jobs=-1,
tuner_class=None,
verbose=True,
optimize_thresholds=False):
optimize_thresholds=False,
max_batches=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, after thinking on it for a bit, my preference would be to name this _max_batches and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Was curious about implementation for my own understanding but the impl looks great, tests are fantastic :D

@@ -76,7 +76,8 @@ def __init__(self,
n_jobs=-1,
tuner_class=None,
verbose=True,
optimize_thresholds=False):
optimize_thresholds=False,
max_batches=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, adding the docstring now will also help us with development and keep track of what this is for!

@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl

if self.allowed_pipelines == []:
raise ValueError("No allowed pipelines to search")
if self._max_batches and self.max_pipelines is None:
self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?

@freddyaboulton freddyaboulton force-pushed the 1005-add-max-batches-to-automl-search branch from 1ad0771 to 823714f Compare August 21, 2020 20:03
@freddyaboulton
Copy link
Contributor Author

@dsherry I addressed your comments! Thanks for the feedback.

@@ -76,7 +76,8 @@ def __init__(self,
n_jobs=-1,
tuner_class=None,
verbose=True,
optimize_thresholds=False):
optimize_thresholds=False,
_max_batches=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a good pattern for us to avoid breaking changes but still communicate clearly to users 👍

@@ -129,6 +130,8 @@ def __init__(self,
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

verbose (boolean): If True, turn verbosity on. Defaults to True

_max_batches (int): The maximum number of batches of pipelines to search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Could also mention that max_pipelines and max_time have precedent, but not required

@@ -129,6 +130,9 @@ def __init__(self,
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

verbose (boolean): If True, turn verbosity on. Defaults to True

_max_batches (int): The maximum number of batches of pipelines to search. Parameters max_time, and
max_pipelines have precedence over stopping the search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 💨

@freddyaboulton freddyaboulton merged commit 6cd61ff into main Aug 21, 2020
@freddyaboulton freddyaboulton deleted the 1005-add-max-batches-to-automl-search branch August 21, 2020 22:04
@dsherry dsherry mentioned this pull request Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add max_batches parameter to automl search object
4 participants